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Abstract - Banks and financial organizations are 
really facing the challenge of identifying Threat 
factors, which should be considered while advancing 
the loans/credit to customers. An Ensemble ML 
algorithm is suitable for studying bank credit dataset. 
In future, we intend to build up a ML system risk 
automated system over cloud for financial 
organizations that will incorporate key features to 
determine credit value of customers. 


Most methods for credit threat detection require 
previous data to build and validate models. Applying 
ML algorithms for credit threat determination and 
building prediction model is facing a major problem 
of data incompleteness. Most of the financial 
organizations do not share their information with 
other organizations, so determining credibility of 
customer is difficult. Another major issue faced by 
researchers in building model for Threat detection is 
the presence of noise in the data. 


For this work, several ML techniques are explored 
and evaluated on real credit card datasets. Most ML 
methods have achieved an accuracy of less than 81 %. 
Finally, a Predictive model for Credit Threat 
Detection is proposed which is based on ensemble 
technique. The proposed model is evaluated on basis 
of various performance metrics and comparison is 
done with base classifier (learner). It gave around 82 
% prediction accuracy. 
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I. INTRODUCTION 


l Vhe data landscape has changed over the years. 
What you can or should do with data has 
changed. Storage costs have dropped 
dramatically as data collection continues to grow. 

Some data arrive quickly and constantly require 

collection and observation. Other data arrives more 

slowly, but in very large blocks, often in the form of 
decades of historical data. There may be a problem with 
advanced analytics, or it might require machine learning. 

Credit value is represented as a credit score by Financial 

Organizations. A high credit score grants high credit 
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value. In addition, it sees other factors such as age, health 
status, income, employment status, financial obligations, 
debt owed, accounts, length of payment history and the 
capability to repay debt. Banks also determines the 
interest rate, loan and other fees and fines, terms and 
conditions of a credit or loan on the basis of score. Credit 
worthiness also impacts eligibility for employment, 
insurance, business funding and professional licenses or 
certifications. Credit value is the evaluation examined by 
lenders that find the possibility a borrower may default 
on his debt obligations. [1] Supervising the liquidity risk 
and credit risk is one of the main issues of Bank Risk 
Management. “Liquidity Risk” is the risk in the lack of 
marketability of an investment, when the underlining 
asset cannot be bought or sold quickly enough for 
prevention or mitigation of a loss. “Credit Risk” is the 
main risk of holding a bond. [2] 


Management models including ‘Probability of Default’, 
Expected Loss etc. Here, in this work we find 
‘Probability of Default’ of credit card using Predictive 
model which is based on neural network technique. The 
aspects of credit threat management as shown in Figure 
1.1. 
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Figure 1.1: Credit Threat Management — Aspects 


Il. LITERATURE REVIEW 


The authors [1] explained that Recursive Feature 
Elimination with Cross-Validation and Principal 
component Analysis have been used for dimensionality 
reduction. Metrics such as Fl score, AUC score, 
prediction accuracy, precision and recall have been used 
to evaluate each model. Among all the models, the 
combination of a tuned Support Vector Machine (SVM) 
and Recursive Feature Elimination (RFE) with Cross- 
Validation have shown great promise in identifying loan 
defaulters. The support vector machines can outperform 


fe} 
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other tree-based models or regression models if the setup 
of the experiment is similar to that of ours and recursive 
feature elimination with cross-validation can outperform 
models based on principal component analysis. For 
future improvements we would like to use more current 
data and from different sources for illustrating a better 
understanding of the trends present in this field. 


Authors [12] compares the long list of Fin-Techs, one of 
the most attractive platforms is the Peer-to-Peer (P2P) 
ending which aims to bring the investors and borrowers 
hand in hand, leaving out the traditional intermediaries 
like banks. This paper investigates the machine learning 
techniques on big data platforms, analysing the credit 
scoring methods. It is concluded that on a HDFS 
(Hadoop Distributed File System) environment, Logistic 
Regression performs better than Decision Tree and 
Random Forest for credit scoring and classification 
considering performance metrics such as accuracy, 
precision and recall, and the overall run time of 
algorithms. 


Among the three methods, Logistic Regression has the 
best accuracy, precision and recall, compared to Decision 
Tree and Random Forest. Considering the general belief 
that Logistic Regression and Random Forest are the most 
accepted and used methods for credit scoring, we saw 
that this is also true for HDFS. Both Logistic Regression 
and Random Forest have better results than Decision 
Tree. According to accuracy and precision of models, 
runs with more Data Nodes have performed better than 
others while the non-HDFS has performed almost as 
good as a three-node configuration. 


Authors [6] The Threat assessment method consist 
identification and rank. The power market settlement 
Threats by Threat identification include data Threat, 
credit Threat, tax Threat and policy Threat, and the 
Threat ranking is carried out by using triangular fuzzy 
numbers to determine the influence degree of the four 
Threats on the settlement of power market 


A method based on triangular fuzzy numbers for Threat 
factor ranking is used in this paper. This method can 
effectively compare the severity of settlement Threat 
factors in power market and analysis results can provide 
a basis for more accurately and effectively selecting the 
corresponding control measures in power market. 


The case analysis in the paper shows that when most of 
the language fuzzification descriptions are selected 


2020 The authors. This is an open access article under the CC-BY license 


during analysis, the values of the four power market 
settlement Threats calculated by the triangular fuzzy 
numbers are not much different, which indicates data 
Threat, credit Threat, tax Threat and policy Threats 
significantly affect all the security of power market 
settlements. 


LI Changjian et al. [3] the purpose of this paper is to 
evaluate credit Threat for the rural credit cooperatives 
using artificial neural network model. We establish credit 
Threat assessment index system for rural credit 
cooperatives. Then, a kind of credit Threat assessment 
model based on particle swarm optimized neural network 
is put forward. Using neural network technology to 
identify the credit Threat can achieve very high accuracy 
rate and overcome the credit of many uncertain factors. 
The model can provide scientific reference to the rural 
credit cooperatives credit policy and credit Threat 
management. 


R.S.Ramya et al. [5] Information gain measure identifies 
the entropy value of each specific feature. The amount of 
information gain or entropy is used to decide whether the 
feature is selected or deleted. Gain ratio applies 
normalization technique to information gain using spilt 
information value. The correlation based feature 
selection uses heuristic search strategies to estimate how 
the features are correlated with the class attribute and 
how they are important of each other. The feature 
selection techniques such as information gain, gain ratio, 
chi square correlation were applied to the German credit 
dataset available in UCI Machine Learning Repository. 
These feature selection techniques selected the features 
that will be useful for classification of clients and the 
ones that are irrelevant and redundant are omitted. The 
performance, robustness and usefulness of data mining 
algorithms are improved when relatively few and 
relevant features are involved in the process. 


Ill. METHODOLOGY 


The methodology of proposed work is explained with the 
help of Figure 3.1 showing screenshot of real experiment 
performed on Azure Machine Learning Workspace. 


Azure ML platform allows configuring — several 
simulation parameters. 
The methodology of building Predictive Model 


(Classifier) is revealed here in Figure 3.1. 
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Figure 3.1: Methodology of Proposed Work 


In the proposed model, the Cloud Computing platform is 
used to avoid the computation limitations. Some filters 
are applied for better prediction accuracy, faster 
evaluation. Application of proposed framework provides 
better data classification, better predictive accuracy than 
some benchmark classifiers. The implementation of 
predictive classifier (model) is done over Microsoft 
Azure ML studio. 


In Second phase suitable split is applied for training and 
testing the modelfor good convergence of model. The 
best model for classifying the data is chosen in this phase 
by iteratively applying various models and scored the 
model with Test data on the basis of performance 
metrics. In this phase we applied three ML methods 
including proposed method. The phases of model are 
depicted in Figure 3.1. The model is built over following 
algorithms: 


1. Bayes Point Machine 

2. Logistic Regression 

3. Deep support vector machine 
IV. RESULTS 


MAMLS provides ML Workspace with (a) ML studio, 
(b) ML Gallery and (c) ML Web Service Management. 
Azure ML studio is a graphical tool that is in use to 
organize and conduct the process of ML model building, 
testing and deployment. It includes: a collection of data 
pre-processing modules; a collection of ML algorithms; 
An Azure ML API to deploy model as application on 
Azure. ML Studio allows a user to import new datasets, 


pre-processing methods, ML algorithms and more onto 
its workspace. 
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Anomaly Detection 
Figure 4.1: Azure Machine Learning Components 


As the Figure 4.land 4.2 suggest, Machine Learning 
Studio lets a user drag and drop datasets, data cleaning, 
pre-processing modules, machine learning algorithms 
and more onto its workspace. The user can connect these 
together, and then execute the experiment. Once the 
model is builta user can run the experiment to evaluate 
the model created. User can use ML Studio to deploy this 
model to Microsoft Azure, where applications can use it. 
ML Studio provides a single tool for controlling the 
entire machine learning process. 
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Figure 4.2: Models on Machine Learning Studio (Microsoft Azure) 


V. SIMULATION RESULTS AND ANALYSIS 


In the past, many works in this field are proposed. The 
frameworks or classifiers proposed for these tasks are 
based on well-known DM techniques and Machine 
Learning algorithms and giving good accuracy rates. 
Although the accuracy is increased by 1% in this work 
and other metrics like correlation coefficient and MAE is 
improved. Also, the model building time is also less for 
our work as compared to base model. 


The parameters along with result evaluation and analysis 
are presented below: 


Accuracy:The accuracy of model is measured generally 
on basis of correctly classified instances. The comparison 
is depicted in Figure 5.1. 


A TP+TN <a 
= 
couracy = No. of Instances 


True Positive: It represents number of correctly 
identified instances from among the total number of 
correct instances. 


Recall: It is also called Sensitivity. It is defined as 
number of positive cases that are correctly identified. 
TP 
Recall = TP 4+ FN 

The results mentioned here are as per the simulation 
scenario shown in the section 4.6. The results are shown 
in Table 5.1 for graphical representational of results. For 
evaluation the results of Base Learners and Proposed 
model are compared (refer Table 5.1). The Accuracy and 
Model Building time is also improved in comparison to 
base learners. 


Table 4.3: Comparison of Results 


Comparison Of Result 
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Figure 5.1: Comparison of Result 
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